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Complex Question Answering (CQA) is commonly used for answering 
community questions which requires human knowledge for answering them. 
It is essential to find complex question answering system for avoiding the 
complexities behind the question answering system. In the present work, 
we proposed Hierarchy based Firefly Optimized k-means Clustering 


(HFO-KC) method for complex question answering. Initially, the given input 





query is preprocessed. It eliminates the way of misclassification when 
comparing the strings. In order to enhance the answer selection process, 
the obtained keywords are mapped into the candidate solutions. 
After mapping, the obtained keywords are segmented. Each segmentation 
forms a new query for answer selection and various number of answers 
selected for each query. Okapi-25 similarity computation is utilized for the 
process of document retrieval. Then the answers selected are classified with 
K means clustering which forms the hierarchy for each answer. Finally the 
firefly optimization algorithm is used for selecting the best quality of answer 
from the hierarchy. 
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1. INTRODUCTION 

Semantic information published on the web is increased rapidly with linked data initiative. 
However it is typically complex for the user to search and query the vast amount of structured and 
heterogeneous semantic data [1]. It is essential to build a system which can able to answer from different 
domain. It is termed as open domain question answering system which should be access the knowledge in 
novel way [2]. When concern about the stored data, the volume is high and it increases the burden of filtering 
and browsing the result for retrieving precise information. Question answering system is a technology used to 
find, extract, and provide a proper answer to the user's query in the natural language format [3]. 
The repositories are specially made for accomplishing several tasks like question answering, knowledge 
mining and searching [4]. Data mining is a subfield of computer science that enables intelligent extraction of 
useful information [5]. 

Due to its large and growing structure of data, efficient and intuitive techniques are essential to deal 
with them. The complexity and ease of interference is taken into account while processing the data [6]. 
Instead of knowing the query language, the knowledge graph extracts the structure and relation between the 
question and answer [7]. In addition with collaborative information seeking and sharing, collaborative 
answers are also included. The community agreements among Question Answering (QA) pairs are obtained 
with micro collaboration and the enhancement of collective intelligence [8]. The keywords from the query 
are matched with the metadata in which sequence of answers are retrieved for the given query. 
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The semantic question answering system was developed in which uncertain words are the question. 
The fuzzy based ontology system is developed by the researchers in the text extraction level. 
The characteristics of data are analyzed to check the possibility of solving frequently posed questions [9]. 
The search facility is the main feature of CQA services which permits the members to search their archives. 
Normally, the information retrieval approaches are developed in which the member can construct and send 
arbitrary collection of questions until the old question for the current need is obtained [10]. The reuse of past 
QA pairs provides the benefit of enhancing user experience [11]. 

The efficiency of processing natural language questions are improved while heterogeneous data is 
utilized as an answer source. The usage of unique source is not straightforward because of pattern 
variation [12]. When mapping the question with the semantic content of knowledgebase, depth information is 
required [13]. Group based recommendations are developed with two techniques namely aggregation of 
interesting profile and aggregation of recommendation list [14]. The terminology used in NL question varies 
from the terminology used in knowledge base. The solution for conceptual disambiguation is essential for 
searching the matches from homogeneous or heterogeneous resources [15]. 

The machine learning paradigms are developed recently for classifying, organizing and extracting 
relevant information. Even though, the question classification is more accurate, it is required to make the 
QAS comprehension more understandable for easily obtaining the correct answer [16, 17]. It faces the 
difficulties such as linguistic gap between the documents and search queries and the unavailability of recently 
posed questions. Hence it is not possible for searching CQA achieves for obtaining web queries [18]. 
The similarity between question and matching words provide the extraction features for top ranked 
answer [19]. 

The outline of this paper is described as follows. Section 2 briefly explains the proposed method of 
complex question answering system. Section 3 describes the Research Method, Hierarchy based Firefly 
Optimized k-means Clustering (HFO-KC). In Section 4, the experimental results are analyzed. Section 5 
discusses the significant aspects of the work and concludes. 


2. THE PROPOSED METHOD 

In the proposed method of complex question answering system, initially the input query is 
preprocessed. After preprocessing, the keywords are obtained and they are segmented. For each segment, 
number of answers are extracted. 

In order to select the correct answer for the given input query, the collected answers are classified 
with k means clustering and the best answer is selected using firefly optimization algorithm. K-Means is one 
of the promising and effective clustering algorithm [20]. Clustering plays a wide role in the recent 
development of computer science [21]. In Machine learning, supervised learning known as classification and 
unsupervised learning known as clustering [22]. The flow diagram of proposed CQA system is shown 
in Figure 1. 
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Figure 1. Optimal hierarchy based k means clustering for complex question answering 
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2.1. Preprocessing and Question Segmentation 

The preprocessing can be applied to the input query and the collection of documents. Initially, 
the individual keywords are extracted and the stop words are removed. After stop word removal, 
word lemmatization is applied for the remaining keywords. Each keyword is mapped with its corresponding 


templates. After preprocessing, the input keywords contain ” tuples Q = {a,,q,...a,,}. Each keyword is 


mapped with set of templates denoted as a = {f,,t,...t,,}. Then the templates are grouped to form segments 


and the number of answers are selected for each segment. The block diagram for preprocessing and question 
segmentation is shown in Figure 2. 
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Figure 2. Preprocessing and question segmentation 


2.2. Answer Selection for the Given Input Query 

For the segmented questions, the answer is selected based on the Okapi-BM25 score. The score is 
computed for each answer. It selects initial set of relevant answers based on similarity and it can process 
efficiently than cosine similarity measurement. By using the following formula, the Okapi-BM25 score is 
computed: 


() 





b, +1 b, +1 
Okapi(Q, A, ) = » ma 1 + yatf x ( 3 + )atf 
teQNa, B+ atf b, + qtf 


Where, Q represents the query, A represents the answer for the given query, gtf is the question term 


frequency, atf is the answer term frequency, and b, »b; represents the constant parameters. The value of B 


is computed as: 





B=), o+fe a (2) 


aval 


Where, C represents the constant parameter, al represents the answer length and avalrepresents the 
average answer length. The weight value used in equation (1) is defined as 





mile (P— p+0.5) 
(p +0.5) 


(3) 


WwW 


Where, P is the number of answers, p represents the number of answers having term f. The top relevant 


answers are selected from the document based on Okapi-BM25 method [23]. These top relevant documents 
are utilized for further processing in terms of document classification. 
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3. RESEARCH METHOD 
This section describes Hierarchy based Firefly Optimized k-means Clustering (HFO-KC) method 
for complex question answering. 


3.1. Answer Classification with hierarchical K-Nearest Neighbor 

By using the score obtained, hierarchical K-Nearest Neighbor (KNN) is utilized in which the top 
relevant answers are separated. The answer with highest and lowest score is separated as different groups. 
The KNN classification can be accomplished based on the centroid score and each time the new group is 
formed hierarchically. KNN is one of the simplest and popular supervised learning algorithm for 


classification [24]. The input data taken by the KNN are the input value k and the collection of answers used 
for classification. The classification problem is solved by the number of nearest neighbors which are taken 


for the input parameter k . It is the straightforward approach for classification. For each group the k nearest 
neighbors are computed based on the centroid value. Initially the answers are randomly divided into two 
groups. From each group, the centroid value is chosen based on the score. Then the distance between the 
centroid value and the remaining tuples are computed. The tuples are added into the group which produces 
less distance when compared with the other group. The KNN algorithm for answer classification is described 
as follows. 


Algorithm 1: KNN algorithm for answer classification 

Input: Answer collection with score, k value 

Output: classified set of answers 

Step1: The score from each answer is taken into consideration for answer selection. 

Step 2: The answers are divided into k groups randomly 

Step 3: Select centroid from each group. 

Step 4: For each answer, compute the distance between the answer and centroid. 

Step 5: The answer is added with the group which produce minimum distance when compared with 
the other groups. 

Step 6: Similar to that all the answers are added to the relevant group. 

Step 7: After dividing into k groups again the centroid value is selected and new group is formed. 
Step 8: The process is repeated until the centroid is same for the proceeding iterations. 


The collection of answers can be considered as a data point in n dimensional space. The number of 


attributes are denoted as n. In order to compute the distance between two data points the Euclidean distance 
is used. The Euclidean distance between data points X and y is calculated as 


d= fDi 9) 





(4) 


Where, 1 represents the number of attributes in data set x,and y, are values of attribute i in data tuples x 
and y respectively. Instead of using Euclidean distance, Minkowski distance and Manhattan distance also be 


used. The simplest case of this algorithm is attained with setting the value of k to one. The specific property 
of this algorithm is predicting the continuous valued attributes instead of using categorical attributes. 


3.2. Optimized Answer Selection with Firefly Algorithm 

After grouping the answers, the accurate answer relevant to each query is selected based on firefly 
optimization. It is a meta-heuristic algorithm for finding optimal solution for the optimization problem. The 
concept behind this firefly optimization is the flashing behavior of each firefly. Set of assumptions were 
made for this firefly optimization. They are 
a) Itis assumed that all fireflies can be attracted by the other fireflies. 
b) The attractiveness is represented by its brightness. The firefly which has lower brightness is attracted by 

the firefly which has higher brightness. 

c) The fireflies having same brightness are moved randomly. 

The attractiveness of a firefly is calculated using following function: 


Br) = Bye” (5) 
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Where, /3, is the attractiveness of the firefly when r =Qand y is light absorption coefficient. The firefly’s 
movement totally depends on its attractiveness. Firefly 1 would move towards firefly j if and only the 
attractiveness of the firefly j is greater than that of firefly i. In that case, the movement is shown by 
following formula: 


2 
Vy 


X, =X, + Bo.€ (Xin — Yj) + aS, (rand, —0.5) 


(6) 


xX, and y je are values of attribute k . k takes values from 1,2,...2, where 7 is the dimension of 
the data set. rand, is a random number between 0 and 1. @ is called randomization parameter which will 
decide how much to move and takes value between 0&1, S, is scaling parameter which is calculated for 
each attribute. S, is calculated as 


S, =|u, -1,| (7) 


u, and l , are the upper bound and lower bound of the attribute k respectively. r,, is the distance 


between the fireflies i and j which calculated from: 





es 


t= ee (%; i yi) (8) 


The value of attractiveness in optimization problems is calculated using an objective function. 
The algorithm for standard firefly algorithm is given below: 


Algorithm 2: Firefly optimization 

Input: Objective function f(x) and algorithm parameters @,, oe , and y 
Output: Minimized function value position 

Step 1: Initialize firefly population p randomly. 

Step 2: Initialize algorithm parameters @,, 2, , and y. 


Step 3: Calculate fitness value using the objective function f (x) for each firefly. 
Step 4: while t < max generation 
fori=1:p 

for j=1:i 

if ( f(%)) < FO) 

move firefly 7 towards j using (3) 
calculate fitness value again of all 
fireflies 

end if 

end for 

end for 


end while 
Step 5: Rank the fireflies to find the current best firefly. 


In present paper, the preprocessing can be accomplished initially and it makes easier for further 
processing. After preprocessing the relevant answers are collected and they are classified with KNN 
classifier. Finally, in order to improve the classification accuracy and for finding the correct answer, 
the optimization algorithm firefly is used. In this CQA system, the complexity of the processing is reduced 
with the help of simplest algorithm. When compared with the existing literatures, the trade-off between 
complexity and accuracy can be attained. 
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4. RESULTS AND DISCUSSION 

The Factoid Q&A Corpus is used as a dataset in our work for complex question answering [25]. 
It consists of 1,714 factoid questions which are created manually. The answer for the question is collected 
from Carnegie Mellon University and University of Pittsburgh in between 2008 and 2010. For KNN 


algorithm the K value is defined as 2 and the constant parameters are b, =1.2, c=0.75 and b, = 7.0 


.The proposed HFO-KC is compared with the existing approaches such as JAIST, ICRC and RCNN [26]. 
The performance metrics such as precision, recall, f-measure, accuracy and complexity are evaluated for the 
proposed approach and compared with the existing approaches. The improved performance of the proposed 
approach shows the efficiency of the technique. 


4.1. Precision 

Precision computes the correct prediction of positive observations from the total number of 
predictions with positive observations. The performance comparison of the proposed CQA is shown in 
Figure 3 and Figure 4. In Figure 3, the precision value is compared by varying the number of documents to 
300, 500, 700 and 1000. The precision value is reached near |. That is near optimal performance is obtained 
with our proposed method. When the numbers of documents are 300, the precision values obtained for the 
existing methods are 0.58, 0.57, 0.57, and 0.59. For 500 documents, the precision values are 0.55, 0.56, 0.55 
and 0.58. The number of documents are increased to 700 and 1000 then the existing precision values are 
0.54, 0.53, 0.54, 0.56 and 0.53, 0.52, 0.53, 0.55. But in case of proposed algorithm the precision value is 
improved as 0.99, 0.98, 0.96 and 0.94 for the number of documents 300, 500,700 and 1000. The average 
precision values computed by RCNN, ICRC, JAIST, A-ARC I and HFO-KC are 0.545, 0.545, 0.5475, 0.57 
and 0.9675 as shown in Figure 4. The improved precision values shows the efficiency of the proposed 
approach. 
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4.2. Recall 

Recall computes the correctly predicted positive observations from the total number of observations. 
The recall values obtained by RCNN are 0.56, 0.55, 0.54 and 0.53 for the number of documents 300, 500,700 
and 1000. For ICRC these values are 0.56, 0.5, 0.53 and 0.52, JAIST produces the precision values as 0.57, 
0.565, 0.56 and 0.555. The existing A-ARC I have the precision values 0.58, 0.57, 0.56 and 0.55. For our 
proposed CQA system, the recall values produced are 0.93, 0.9, 0.89 and 0.87 as shown in Figure 5. 
The average recall values computed by RCNN, ICRC, JAIST, A-ARC I and HFO-KC are 0.545, 0.53, 0.55, 
0.57 and 0.9 as shown in Figure 6. When the numbers of answers are increased, then the recall value is 
reduced. For less number of answers, the recall value obtained is high. 
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4.3. F-measure 

The weighted average between precision and recall is termed as f-measure. For 300 documents, 
the RCNN, ICRC, JAIST, A-ARC I and HFO-KC have the f-measure values 0.55,0.56, 0.57, 0.58 and 0.957. 
For 500 documents, the RCNN, ICRC, JAIST, A-ARC I and HFO-KC have the f-measure values 0.55, 0.555, 
0.56, 0.57 and 0.941. For 700 documents, the RCNN, ICRC, JAIST, A-ARC I and HFO-KC have the 
f-measure values 0.545, 0.5, 0.55, 0.56 and 0.957. For 1000 documents, the RCNN, ICRC, JAIST, A-ARC I 
and HFO-KC have the f-measure values 0.54, 0.45, 0.54, 0.55 and 0.9 as shown in Figure 7. The average 
f-measure values obtained by RCNN, ICRC, JAIST, A-ARC I and HFO-KC are 0.548, 0.516, 0.555, 0.565 
and 0.928 as shown in Figure 8. The f-measure values obtained by the proposed method is high when 
compared with the other existing approaches. 
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4.4. Accuracy 

Accuracy computes the correct observations from the total number of observations. The accuracy of 
the proposed approach is evaluated and compared with the existing approaches. When compared with the 
existing approaches, the accuracy of the proposed technique is high. The accuracy value obtained for RCNN, 
ICRC, JAIST, A-ARC I and HFO-KC is 0.72, 0.68, 0.72, 0.76, and 0.991 for 300 documents. When the 
documents are 500, the accuracy value obtained for RCNN, ICRC, JAIST, A-ARC I and HFO-KC is 0.71, 
0.67, 0.715, 0.75 and 0.982. For 700 documents, RCNN, ICRC, JAIST, A-ARC I and HFO-KC produces 
0.705, 0.665, 0.71, 0.74 and 0.972. By increasing the number of documents to 1000, the accuracy is 0.7, 0.65, 
0.7, 0.73 and 0.962. The improved performance is obtained with our proposed approach as shown in 
Figure 9. The Average accuracy for various CQA approaches as shown in Figure 10. 
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The complexity of the proposed HFO-KC complex question answering system has the complexity 
of O(ndk) + O(m’t) . Where, d represents the dimension of each answer, 7 represents the cardinality of 


the document, ™ represents the population size and fis the number of iterations and k represents the 
number of groups used on KNN algorithm. The computation time for the proposed work is 15ms.The 
proposed HFO-KC approach for complex question answering can be evaluated with the performance metrics 
like precision, recall, accuracy, f-measure and complexity. When compared with the existing approaches, the 
performance of the proposed approach is high. The proposed approach provides the trade-off between 
complexity and accuracy. 


5. CONCLUSION 

In this paper, initially the input query is preprocessed. It includes stop word removal and word 
lemmatization. Then individual keywords are extracted from the query and the extracted keywords are 
segmented. The process of segmentation is accomplished with the collection of keywords. The candidate 
solutions are mapped from the obtained keywords. The correct answer is retrieved from the database using 
the segmented query. It can be obtained with Okapi-25 similarity computation. Based on the similarity score, 
the large number of answers are selected for the given question. Then the selected answers are clustered with 
K means clustering in which it eliminates the incorrect answer selection. The hierarchy is formed with the 
algorithm which simplifies the process of answer selection. From the hierarchy, the optimized result is 
obtained with firefly optimization. 
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